Which surrogate insulin resistance indices best predict coronary artery disease? A machine learning approach

Background Various surrogate markers of insulin resistance have been developed, capable of predicting coronary artery disease (CAD) without the need to detect serum insulin. For accurate prediction, they depend only on glucose and lipid profiles, as well as anthropometric features. However, there is still no agreement on the most suitable one for predicting CAD. Methods We followed a cohort of 2,000 individuals, ranging in age from 20 to 74, for a duration of 9.9 years. We utilized multivariate Cox proportional hazard models to investigate the association between TyG-index, TyG-BMI, TyG-WC, TG/HDL, plus METS-IR and the occurrence of CAD. The receiver operating curve (ROC) was employed to compare the predictive efficacy of these indices and their corresponding cutoff values for predicting CAD. We also used three distinct embedded feature selection methods: LASSO, Random Forest feature selection, and the Boruta algorithm, to evaluate and compare surrogate markers of insulin resistance in predicting CAD. In addition, we utilized the ceteris paribus profile on the Random Forest model to illustrate how the model’s predictive performance is affected by variations in individual surrogate markers, while keeping all other factors consistent in a diagram. Results The TyG-index was the only surrogate marker of insulin resistance that demonstrated an association with CAD in fully adjusted model (HR: 2.54, CI: 1.34–4.81). The association was more prominent in females. Moreover, it demonstrated the highest area under the ROC curve (0.67 [0.63–0.7]) in comparison to other surrogate indices for insulin resistance. All feature selection approaches concur that the TyG-index is the most reliable surrogate insulin resistance marker for predicting CAD. Based on the Ceteris paribus profile of Random Forest the predictive ability of the TyG-index increased steadily after 9 with a positive slope, without any decline or leveling off. Conclusion Due to the simplicity of assessing the TyG-index with routine biochemical assays and given that the TyG-index was the most effective surrogate insulin resistance index for predicting CAD based on our results, it seems suitable for inclusion in future CAD prevention strategies.


Introduction
Globally, cardiovascular diseases (CVDs) continue to significantly impact mortality rates and overall health outcomes [1].Coronary artery disease (CAD) stands out as the most prevalent type among cardiovascular diseases (CVDs), exhibiting noticeable increases in its prevalence and incidence across the majority of countries [2].From 1990 to 2019, the number of deaths and disabilityadjusted life years (DALYs) caused by CAD has risen steadily.In 1990, there were around 5 million deaths and 120 million DALYs, but in 2019, there were 9.14 million deaths and 182 million DALYs [2].This emphasizes the urgent need for precise identification of risk factors to predict and prevent CAD.
Insulin resistance is commonly regarded as one of the key risk factors for predicting CAD [3][4][5].It is associated with chronic low-grade inflammation [6] which can lead to pro-coagulation states [7], decreased bioavailability of nitric oxide, and subsequently impaired endothelial function [8].Further, insulin resistance can activate the sympathetic nervous system and reduce vagal activity, resulting in the activation of the renin-angiotensin-aldosterone system and kidney sodium retention, ultimately causing higher blood pressure and cardiovascular damage [9].Remarkably, despite its considerable importance, it has not been incorporated into any internationally risk assessment frameworks for the prediction of CAD [3][4][5]10].
The hyperinsulinemic-euglycemic clamp technique serves as the standard for diagnosing insulin resistance, but its invasiveness, cost, and complexity make it unsuitable for epidemiological studies [11].The Homeostasis Model Assessment of Insulin Resistance (HOMA-IR) is a commonly employed alternative, offering ease of use; however, this test cannot be used to diagnose people who are already undergoing insulin treatment [12,13].Additionally, HOMA-IR has another limitation, as laboratories do not routinely measure circulating insulin concentrations [14,15].
In light of the drawbacks of direct measurement of insulin, numerous surrogate markers, based on glucose and lipid profiles as well as some anthropometric features, have emerged.These surrogate markers do not necessitate the measurement of serum insulin levels, and they have an even better correlation with the hyperinsulinemic-euglycemic clamp method compared to HOMA-IR [16][17][18].The ratio of triglycerides to high-density lipoprotein cholesterol (TG/HDL-C), triglyceride-glucose index (TyG index), TyG-index with body mass index (TyG-BMI), TyG index with waist circumference (TyG-WC), and metabolic score for insulin resistance (METS-IR), are the most common of these less complicated and practical markers [19,20].Although prior studies have shown associations between these indices and CAD, there is no specific threshold for utilizing these indices, and it remains uncertain which one of them better predicts CAD [21][22][23].
Determining the most reliable predictor among these comparable indices poses a significant challenge in clinical environments, where they can aid in screening and preventive measures to reduce CAD.In this regard, in addition to the conventional statistical methods, we have decided to employ embedded feature selection techniques, which involve the fusion of machine learning algorithms with the process of selecting features [22,23].The main advantage of these machine learning algorithms over traditional statistical methods is their reduced emphasis on hypothesis-driven inference [24,25].Instead, they prioritize predictive accuracy and can algorithmically derive covariate interactions [24,26].These characteristics enable us to evaluate the impact of each feature on CAD prediction comprehensively.
To determine which of these indices best predict CAD occurrence, we first investigated the association between different surrogate markers of insulin resistance and CAD in a 10-year prospective cohort study.Then, we evaluated the optimal cut-off points for these surrogate markers as CAD prediction tools.The ultimate objective was to develop embedded feature selection machine learning algorithms for CAD prediction and to compare the unique impacts of insulin resistance markers on CAD prediction.

Study population
Data for this cohort study were derived from the Yazd Healthy Heart Project (YHHP), an epidemiological study investigating cardiovascular and metabolic illnesses in a population-based setting.In summary, a total of 2000 Iranian adults (1000 men and 1000 women) between the ages of 20 and 74 were selected using a cluster random sampling technique.The participants were recruited from the urban population of Yazd city during the period of 2005-2006 [27].

Inclusion and exclusion criteria
From the 2000 participants, 17 were omitted from the study due to loss during the second phase; from the 1983 individuals participating in the baseline examination, 62 were excluded due to diagnosis of CAD at baseline, 78 due to death during the study, and 312 due to missing data.The remaining 1531 participants (791 men, mean age 48.6 ± 14.7 years) were included in the present study (Fig. 1).

Biochemical analyses
Lab analyses were conducted following an overnight fasting.Glucose and triglyceride (TG) levels were measured following centrifugation using kits obtained from Pars Azmoon Inc.(Tehran, Iran).The lipid profiles, including total cholesterol, low-density lipoprotein (LDL), and high-density lipoprotein (HDL), were examined using Bionic kits manufactured by Bionic Company (Tehran, Iran).The tests were conducted utilizing a biochemical autoanalyzer (BT 3000, Italy).The key exposure variables of interest were calculated using the following equations [18]:

Anthropometric features
The participants' heights were measured with a stadiometer attached to a smooth wall with no dents or irregularities.They stood barefoot, with their heels, hips, shoulders, and heads touching the wall and fixed horizontally.The heights were measured with a 0.5 centimeter margin of error.Participants were weighed with minimal clothing on a digital scale (Seca, Germany).The participants' weight was measured with precision to the nearest 0.1 kg in both phases.The circumferences of the waist and hips were measured using a non-stretchable tape at the superior border of the iliac crest and the widest part of the buttock, respectively.

Blood pressure measurements
The participants' right arm blood pressure was measured by an Omron M6 comfort digital automatic blood pressure monitor in a sitting position.Nursing staff measured blood pressure twice, with a five-minutes interval between measurements.

Physical activity, family history of premature CAD, smoking, and education
Trained interviewers utilized questionnaires to gather demographic information, physical activity, smoking habits, family history of early premature CAD, and angina pectoris.The assessment of physical activity was conducted using the International Physical.Activity Questionnaire (IPAQ) [28].As part of this survey, the participants were questioned about the duration and number of days of their walking, engagement in moderate intensity exercise, and strenuous activity.Based on these inquiries, the number of MET-hours per week was computed, which is equivalent to 1 kcal/kg/hr [29].Using this metric, the participants were categorized into low-, moderate-, and high-activity groups.Based on current smoking habits, the participants were categorized into two groups: smokers and nonsmokers.Family history of premature CAD was defined by the occurrence of CAD in a mother or sister before the age of 55, or in a father or brother before the age of 45.

Outcome definition
CAD events were identified based on medical records documenting occurrences of fatal or nonfatal CAD, myocardial infarction, coronary artery bypass graft, positive exercise tests, positive cardiac enzymes, and positive percutaneous coronary angiography.In addition, all participants completed the Rose angina questionnaire (RAQ) [30], a validated tool for assessing new angina.The participants also had electrocardiograms (ECG), which were reviewed by both a general practitioner and a trained nurse.If any discrepancies arose, a cardiologist confirmed the findings.In addition to medical records, CAD was classified as having positive RAQ and findings of ischemia in the ECG.

Statistical analysis
SPSS version 27.0 (IBM Corp., Armonk, NY, USA), Python 3, and R version 4.2.2 (www.R-project.org)were used for statistical analysis.Continuous variables were described as mean ± standard deviation (SD) and compared by ANOVA.Chi-square tests were used to compare categorical variables as numbers (percentages).We employed multivariable Cox proportional hazard models to assess the association between quartiles of these indices and the CAD incidence.We employed two multivariable models for adjustment.Model 1 was adjusted for age and sex, whereas model 2 was adjusted for model 1 plus systolic and diastolic blood pressure, total cholesterol, LDL, HDL, BMI, waist to hip ratio, family history of premature CAD, physical activity, and smoking.If any of these factors were included in exposure variables (surrogate insulin resistance indices), we excluded them from the adjustment process.For instance, when analyzing TG/HDL ratio, we did not incorporate HDL into the statistical model.
We employed the receiver operative characteristic (ROC) curve to compare the predictive performance of all indices relative to one another.Then, we assessed the optimal cutoff points of surrogate insulin resistance indices with maximum sensitivity and specificity simultaneously, maximum, negative and positive diagnostic ratio, as well as maximum Youden index for predicting CAD using "OptimalCutpoints" R package [31].In addition, we categorized these thresholds according to gender.
In order to choose the best surrogate insulin resistance marker for predicting CAD, we combined integrative methods with an ensemble of different embedded feature selection methods based on machine learning [23].For integrative part of our approach, we selected age, sex, systolic blood pressure (SBP), diastolic blood pressure (DBP), LDL, total cholesterol, smoking, family history of premature CAD, and diabetes as our reference variables for comparing our surrogate measures of insulin resistance.For the embedded feature selection part, at first, we used random forest feature selection, which is a nonlinear algorithm which can consider multiple interactions and evaluate variables by determining how much each feature can reduce impurities (Mean Decrease in Impurity [MDI]) [32].For the second approach, we employed the Boruta algorithm, which shuffles the values of each feature and creates shadow features, which represent noise or irrelevant features, then trains a random forest model on original features and shadow features and compares their importance in multiple iterations.If a feature is more important than its shadow, it will be selected [33].As a third approach, we used least absolute shrinkage and selection operator(LASSO), a regularization technique based on linear regression which drives the coefficients of less important features to zero and selects non-zero coefficient variables [34].We set the alpha (threshold of significance) to 0.05 for this algorithm.Finally, we used ceteris paribus profile of the random forest model [35,36].The ceteris paribus profile can graphically depict the effect of altering specific variables on the predictive performance of the model while keeping all other elements unchanged.

Association of surrogate insulin resistance indices with CAD
Table 1 presents the baseline characteristics of participants according to quartiles of surrogate insulin resistance indices.Age, blood pressure, low education, total cholesterol levels, and LDL showed a significant difference between quartiles for all markers.Table 2 reports the association between different surrogate markers of insulin resistance and CAD incidence.In model 1, after age and sex adjustments, the highest values among all indices in the fourth quartile were significantly and positively associated with CAD.Nevertheless, following adjustment for multiple variables in model 2, only the TyG-index was significantly associated with CAD (hazard ratio [HR]: 2.54, Confidence Interval [CI]: 1.34-4.81,P value = 0.007, P trend = 0.02).Only the TG/HDL ratio in men (HR: 1.95, CI: 1.01-3.77,P value = 0.04, P trend = 0.07) and TyG-index in women (HR: 4.76, CI: 1.36-16.66,P value = 0.01, P trend = 0.004) were associated with CAD after final adjustment (Table 3).
Table 4 presents the area under the ROC curve (AUC) and cut-off points for all indices used to predict CAD in men, women, and the total sample.The TyG-index demonstrated superior predictive performance in both the total sample and among women, with AUC values of 0.67 (0.63-0.70,P value 0.001) and 0.72 (0.66-0.77), respectively.However, the TyG-index and the TyG-WC revealed almost identical performance in men.
Figure 2 illustrates several feature selection methods and the ceteris paribus profile of a random forest model.Figure 2A indicates the feature selection process using the Boruta algorithm.According to this algorithm, age, SBP, and TyG-index were the most important variables for predicting CAD.The random forest model revealed that, following age, blood pressure, and sex, the TyGindex exhibited the greatest MDI, thus serving as the most effective surrogate measure of insulin resistance for predicting CAD (Fig. 2B).
Figure 2C depicts the LASSO technique, which is a penalized approach that discards redundant variables.The TyG-index was the only surrogate indicator of insulin resistance that was chosen by LASSO.The Ceteris paribus profile of a random forest model is shown in.
Figure 2D Compared to other indices, the TyG-index had a stronger positive slope without a clear plateau or decline.

Discussion
Our research findings demonstrated that the TyG-index is the most effective surrogate marker of insulin resistance for predicting CAD and it has superior predictive capabilities in women.Not only did traditional statistical methods like Cox hazard regression and ROC analysis show that the TyG-index had a better HR and AUC for CAD compared to other surrogate indicators of insulin resistance, but also advanced feature selection techniques further validated these findings.
Surrogate insulin resistance markers encompass both blood glucose and dyslipidemia markers, serving as indirect indications of insulin resistance in the liver and adipose tissue [37].Furthermore, some of these surrogate markers, including TyG-WC, TyG-BMI, and METS-IR, integrate obesity measures.This approach is grounded in the understanding that a direct relationship exists between insulin resistance and the majority of obesity indicators [38].The advantage of these non-insulin dependent surrogate measures of insulin resistance, compared to the insulin-dependent competitors such as HOMA-IR, lies in their cost-effective and simplified acquisition technique, as well as their stronger association with the gold standard protocol for measuring insulin resistance [11][12][13].Furthermore, research indicates that some of these indices may be more effective predictors of CAD than metabolic syndrome, which itself is a reflection of insulin resistance [39].
The findings from meta-analyses have shown a relationship between the TyG-index [40] and TG/HDL-C ratio [41] with CAD.Additionally, cohort studies have demonstrated the association of TyG-BMI and METS-IR with CAD [19,42,43], while only a cross-sectional study has highlighted a link between TyG-WC and CAD [19].In the current study, TyG-BMI and METS-IR were not associated with CAD and were also found to be the least effective surrogate markers in the feature selection approaches.The potential explanation is in the fact that BMI fluctuations alone, as the sole anthropometric characteristic, fail to accurately indicate the risk of CAD when accompanied with insulin resistance-related traits [44,45].Although, in the present study, TyG-WC was the second most reliable indicator after TyG-index, we found no significant association with CAD.
To date, only four studies have directly compared surrogate markers of insulin resistance and their association with CAD within a single analytical framework [19][20][21]46].Among these, a case-control study highlighted the METS-IR index as more closely associated with CAD than both the TG/HDL and TyG-index, though this conclusion might be affected by Berkson's bias due to the selection process, which targeted participants suspected of CAD and underwent coronary angiography [20].Elsewhere, an analysis of cross sectional data from CAD requires considering complex interactions among several parameters [23], a consideration that is overlooked in traditional techniques.

Embedded feature selection
Embedded feature selection techniques are types of supervised learning dimension reduction techniques used to identify the optimal variables for predicting an outcome [53].Not only do they enhance predictive models' performance and cost-effectiveness [54], they can also help healthcare practitioners select the most appropriate variable from a set of variables that have similar information and overlap with each other for the goal of screening and preventing an outcome.Although there is no flawless integrated feature selection algorithm [55], we can combine these strategies to use their respective advantages and mitigate their limitations [56].Nevertheless, it is important to acknowledge that the decision between using novel techniques such as machine learning and traditional statistical models in predictive analytics is not a clear-cut one.Traditional statistical models offer a transparent depiction of the data, often including a probabilistic framework, which enhances interpretability.These models highlight relevant variables and quantify the strength as well as significance of associations.Conversely, machine learning models tend to be more empirical, prioritizing predictive performance over interpretability.Previous research has indicated that the complementation of conventional statistical techniques and machine learning is the optimum strategy to guide to generalizable and significant findings [57].This is why we employed both of these methods to achieve a more comprehensive interpretation of our data.Ensemble of feature selection approaches in the current study indicated that the TyG-index is the best surrogate marker of insulin resistance for predicting CAD.Following that, the TyG-WC may have the greatest influence.Ceteris paribus profile of random forest model demonstrated that predictive capability of the TyG-index grew after 9 with a positive slope without any decline or flattening out, which was in accordance with the cutoff points of the ROC curve.The TyG-BMI and METS-IR curves displayed a consistently flat and negative slope, while the TG-HDL and TyG-WC curves showed various instances of plateauing or downhill, suggesting that they are not reliable indicators for predicting CAD.
The combination of all three embedded feature selection methods, along with the results of Cox hazard models and ROC curve analysis, demonstrated that the TyG-index is the most reliable surrogate insulin resistance index for predicting CAD.This consensus of findings of different methods demonstrates the stability and reproducibility of the result, thereby increasing confidence in the use of this index [57,58] for CAD prediction.

Strengths and limitations
This study is the first to evaluate and compare the most common surrogate measures of insulin resistance within a unified framework for the prediction of CAD.The prospective structure of our study, which has focused on the community, helps to limit the likelihood of reverse causation and recall bias.Unlike previous studies [19], we employed a consistent approach to define CAD by examining both paraclinical and symptomatic data.This enabled us to reduce the likelihood of misclassification.
This study also had some limitations.A few followup sessions would constrain our ability to assess and regulate voluntary health check-ups as well as lifestyle modifications that may have influenced our findings over the ten-year study period.Further, conducting a study on surrogate insulin resistance indices using a single baseline evaluation may cause our results to be influenced by differences within individuals over time.Above all, our study was conducted at a single center and included only individuals of the Iranian population.Thus, it is important to note that our findings may not be generalizable to populations in other countries.

Conclusion
The findings of the present investigation indicate that the TyG-index is the most efficient surrogate insulin resistance index for predicting and preventing CAD.Given the ease of evaluating the TyG-index using routine biochemical tests, incorporating this tool into clinical screenings and including it in future CAD risk assessment scores can greatly enhance healthcare professionals' ability to manage and lower the risk of CAD.Nevertheless, more research involving multiple centers and diverse ethnic groups is necessary to validate our results.B The mean decrease in impurity (MDI) or Gini importance measures the extent to which every feature contributes to accurate predictions.A higher MDI value indicates that the variable is more important.C LASSO is a regularization approach based on linear regression.Regularization approaches penalize large coefficients because their presence can lead to overfitting.LASSO decreases coefficients of less significant features to zero and selects features that haven't been lowered to zero.A higher coefficient indicates greater importance.D The Ceteris paribus profile examines individual features while holding all other components of the model constant, in order to understand the particular impact of different features on predictions in machine learning models.
A sharper incline on the diagram without a plateau or a downward slope with a higher constant indicate a better feature.

Fig. 1
Fig. 1 Flow diagram of participants attending the 10-year follow-up study.a Coronary Artery Disease

Fig. 2
Fig. 2 Ensemble of embedded feature selection methods.A This figure illustrates the Importance of variables based on their rank in the Boruta method, a lower rank indicates greater importance, while a higher rank indicates lesser importance.The variables highlighted in black are the most important ones.B The mean decrease in impurity (MDI) or Gini importance measures the extent to which every feature contributes to accurate predictions.A higher MDI value indicates that the variable is more important.C LASSO is a regularization approach based on linear regression.Regularization approaches penalize large coefficients because their presence can lead to overfitting.LASSO decreases coefficients of less significant features to zero and selects features that haven't been lowered to zero.A higher coefficient indicates greater importance.D The Ceteris paribus profile examines individual features while holding all other components of the model constant, in order to understand the particular impact of different features on predictions in machine learning models.A sharper incline on the diagram without a plateau or a downward slope with a higher constant indicate a better feature.

Table 1
Baseline characteristics of the participants according to quartiles of different surrogate markers of insulin resistance

Table 2
Risk of CAD according to quartiles of Surrogate markers of insulin resistanceModel 1: adjusted for age and sex, Model 2: model 1 plus systolic and diastolic blood pressure, total cholesterol, LDL, HDL, BMI, waist to hip ratio, family history of premature CAD, physical activity, and smoking* *If any of these factors were included in exposure variables (surrogate insulin resistance indices), we excluded them from the adjustment process

Table 3
Risk of CAD according to quartiles of Surrogate markers of insulin resistance stratified by genderModel 1: adjusted for age and sex, Model 2: model 1 plus systolic and diastolic blood pressure, total cholesterol, LDL, HDL, BMI, waist to hip ratio, family history of premature CAD, physical activity, and smoking* *If any of these factors were included in exposure variables (surrogate insulin resistance indices), we excluded them from the adjustment process

Table 4
Receiver operating characteristic curve and cut-off points of surrogate markers of insulin resistance for CAD prediction in men, women, and the total population